This exercise involves participating in one of Kaggle’s competitions, Store Item Demand Forecasting Challenge.
The instructions suggest a comparison of different methods, some using traditional statistical techniques such as ARIMA models and others using techniques such as XGBoost and neural networks. So as a team we say explore the three possibilities, this report documents the results obtained using ARIMA multivariate models.
The criterion used to measure the performance of the prediction is SMAPE.
Then we read the data and we change some types of data to facilitate data handling
First we seek to make the time series stationary, for this we consider two approaches. The first one approach was to consider each store (and its sales accumulated through the 50 items) as a time series. The second one considering each product (through the 10 stores) as a series of time by itself.
In the following graphs we can see the first original series (lag = 0), first for the first approach:
For brevity we only show the results of the first store and item number 30.
Tienda1
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1555856 83.1 2589004 138.3 2589004 138.3
## Vcells 33378335 254.7 54016728 412.2 53983699 411.9
Then the original series (lag=0) for the second approch:
Item30
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1284129 68.6 3171988 169.5 3171988 169.5
## Vcells 10774883 82.3 51920058 396.2 53983699 411.9
Up to this point we finded that separately the data presents a tendency for both products and stores, we agree that there is a trend. We proceed to identify if the trend is of second order taking the first differences of the two approaches.
Store sales (with lag of 1, first differences).
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1393877 74.5 4545137 242.8 4545137 242.8
## Vcells 22633713 172.7 74430742 567.9 93037572 709.9
Sale of product through stores (first differences, lag = 1)
Estacionalidad ! :D
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1284349 68.6 3966532 211.9 6197707 331.0
## Vcells 18526386 141.4 221948591 1693.4 236086370 1801.2
The first differences above allow us to observe that the trend varies depending on the product and also the store. However, in the case of stores, this trend is more homogeneous.
Later we visualize different lags of the two approaches, then we show the trends of the series for lag = 37:
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1397288 74.7 4217420 225.3 6197707 331.0
## Vcells 23517012 179.5 177558872 1354.7 236086370 1801.2
We proceed to graph the sales of item with lag=37.
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1285239 68.7 4540916 242.6 6197707 331.0
## Vcells 19673234 150.1 235982638 1800.5 236086370 1801.2
In all the sets of graphs above we can intuit that an AR model fits the data, we rule out the possibility of using MA models because for different lag (0,1,2, …, 37) we haven´t approximately stationary series.
At this point when considering adjusting 50 univariate ARIMA models (one to forecast the sale of each item) following our first approach or adjusting 10 univariate ARIMA models, we obtain a performance (measured with SERMA) of approximately 53.1. In fact the benefit gained with 50 series instead of 10 does not justify the computational cost.
Given that the distribution of the values to be predicted (sales) has a positive bias (as we can see in the following graphs), one option was to consider the logarithmic transformation to reduce this bias, however, the prediction error using auro-regression models for the 50 or 10 series were filmed at 72.9 so we discarded the idea of making a transformation only in sales.
## [1] "date" "store" "item" "sales"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Finally, we decided to consider an AR multivariate model, we do not consider MA multivariate models because to date we did not find an implementation of them in the R environment. However, we considered this option less practical, since in our univariate results we observed that the ARIMA models changed a lot in their part of mobile average models, so we considered that they would have little power of generalization (aka forecast).
Then, we proceed to consider a multivariate model of time series, that is, the sale of an item, or store, depends in turn on other items or stores, in an autoregressive manner. Seeking not to overfit our model instead of considering 50 serires (one per article) we consider 10 series (one for the sales of a store).
Then we have the random vector of sales in the 10 stores:
\(\begin{pmatrix} x_1(t+1) \\x_2(t+1) \\ \dots \\ x_{10} (t+1) \end{pmatrix} = \begin{pmatrix} \theta_{1,1}x_1(t)+ \theta_{1,2}x_2(t)+ \dots \theta_{1,10}x_{10}(t)\\ \theta_{2,1}x_1(t)+ \theta_{2,2}x_2(t)+ \dots \theta_{2,10}x_{10}(t)\\ \dots\\ \theta_{10,1}x_1(t)+ \theta_{10,2}x_2(t)+ \dots\theta_{10,10}x_{10}(t)\end{pmatrix}\)
Where \(x_{i}(t)\) represents the sale of the i-th store on the day \(t\), so we only need to stimate 100 parameters the \(\theta_{i,j}\) with \(i,j \in \{1,2,\dots, 10\}\) if we do not consider lags, However, our estimates led us to consider an autoregressive vectorial model with a lag of 49 so we have the following model which requires estimating 4900 parameters.
\(\begin{pmatrix} x_1(t+1) \\x_2(t+1) \\ \dots \\ x_{10} (t+1) \end{pmatrix} = \begin{pmatrix} \theta_{1,1}^0x_1(t)+ \theta_{1,1}^1x_1(t-1)+\dots + \theta_{1,1}^{49}x_1(t-49) + \dots + \theta_{1,10}^0x_{10}(t) + \theta_{1,10}^1x_{10}(t-1) + \dots+ \theta_{1,10}^{49}x_{10}(t-49) \\ \theta_{2,1}^0x_1(t)+ \theta_{2,1}^1x_1(t-1)+\dots + \theta_{2,1}^{49}x_1(t-49) + \dots +\theta_{2,10}^0x_{10}(t) + \theta_{2,10}^1x_{10}(t-1) + \dots+ \theta_{2,10}^{49}x_{10}(t-49) \\ \theta_{10,1}^0x_1(t)+ \theta_{10,1}^1x_1(t-1)+\dots + \theta_{10,1}^{49}x_1(t-49) + \dots + \theta_{10,10}^0x_{10}(t) + \theta_{10,10}^1x_{10}(t-1) + \dots+ \theta_{10,10}^{49}x_{10}(t-49) \\\end{pmatrix}\)
Where \(x_{i}(t)\) represents the sale of the i-th store on the day \(t\), so we only need to stimate and \(\theta_{i,j}^k\) represents the coeficiente of the \(k\)-th lag of the \(i-\)th store in the vector \(j\). So we need 490 parameters for each entry and how we had 10 entries we need 4900 parameters.
We fitt a model \(AR(49)\)
And finally we make the predictions for the whole test:
Forecast example:
## id sales
## 1 0 47
## 2 1 56
## 3 2 67
## 4 3 53
## 5 4 67
## 6 5 51
Although our current model requires the estimation of 4900 parameters, it seems reasonable to validate the assumptions that an autoregressive vector requires such that errors have zero mean, that is, $ E (e_t) = 0, forall, t $ (in our case we have that the mean of the errors is equal to 0.0010154 ).
Another aspect to be considered is that the correlation between the errors of different inputs is zero between series, ie \(cor(x_i(t), x_j(t)) = 0\) if (\(i\neq j\) and \(\forall t\)),in our case this is not true and the correlation matrix is the following (which is far from the identity):
And finally the assumption that the errors have zero correlation in each component of the vector is not fulfilled either because we have the following correlations for each input in the errors:
So our model adjusted to an autoregressive vector of order 49 is far from fulfilling the statistical assumptions necessary to make inference with it, however we have achieved an error of 49.61861 in the challenge.
A future work path consists of smooth the original series (or the 10 series that are taken when considering the stores separately or the 50 series that are obtained when considering the items separately) but so far the implementations of the package smooth they throw errors of up to 200 measure with SMAPE.